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Abstract 

We  present  a  component-based,  trainable  system  for  de¬ 
tecting  frontal  and  near-frontal  views  of  faces  in  still  gray 
images.  The  system  consists  of  a  two-level  hierarchy  of  Sup¬ 
port  Vector  Machine  (SVM)  classifiers.  On  the  first  level, 
component  classifiers  independently  detect  components  of 
a  face.  On  the  second  level,  a  single  classifier  checks  if  the 
geometrical  configuration  of  the  detected  components  in  the 
image  matches  a  geometrical  model  of  a  face.  We  propose 
a  method  for  automatically  learning  components  by  using 
3-D  head  models.  This  approach  has  the  advantage  that 
no  manual  interaction  is  required  for  choosing  and  extract¬ 
ing  components.  Experiments  show  that  the  component- 
based  system  is  significantly  more  robust  against  rotations 
in  depth  than  a  comparable  system  trained  on  whole  face 
patterns. 


1.  Introduction 

Over  the  past  ten  years  face  detection  has  been  thor¬ 
oughly  studied  in  computer  vision  research  for  mainly  two 
reasons.  First,  face  detection  has  a  number  of  interesting 
applications:  It  can  be  part  of  a  face  recognition  system, 
a  surveillance  system,  or  a  video-based  computer/machine 
interface.  Second,  faces  form  a  class  of  visually  similar  ob¬ 
jects  which  simplifies  the  generally  difficult  task  of  object 
detection. 

In  the  following  we  give  a  brief  overview  of  face  de¬ 
tection  techniques  in  still  gray  images.  Since  there  are  no 
color  and  motion  cues  available,  face  detection  boils  down 
to  a  pure  pattern  recognition  task.  A  method  for  detecting 
faces  in  gray  images  by  combining  clustering  techniques 
with  neural  networks  is  proposed  in  [15].  It  generates  face 
and  non-face  prototypes  by  clustering  a  set  of  training  im¬ 
ages.  The  distances  between  an  input  pattern  and  the  pro¬ 
totypes  are  classified  by  a  Multi-Layer  Perceptron.  In  [8] 
frontal  faces  are  detected  by  a  polynomial  SVM  classifier. 


A  system  able  to  deal  with  rotations  in  the  image  plane  was 
proposed  by  [10].  It  consists  of  two  neural  networks,  one 
for  estimating  the  orientation  of  the  face,  and  another  for 
detecting  the  derotated  faces.  The  recognition  step  was  im¬ 
proved  [11]  by  arbitrating  between  independently  trained 
networks  of  identical  structure.  A  naive  Bayesian  approach 
was  taken  in  [12].  The  method  determines  the  empirical 
probabilities  of  the  occurrence  of  small  rectangular  inten¬ 
sity  patterns  within  the  face  image.  In  [13]  the  system  was 
expanded  to  deal  with  frontal  and  profile  views  of  faces  by 
adding  a  separate  classifier  trained  on  profile  views.  An¬ 
other  probabilistic  approach  which  detects  small  parts  of 
faces  is  proposed  in  [6].  Local  feature  extractors  are  used 
to  detect  the  eyes,  the  corner  of  the  mouth,  and  the  tip  of 
the  nose.  The  geometrical  configuration  of  these  features  is 
matched  with  a  model  configuration  by  conditional  search. 
A  related  method  using  statistical  models  is  published  in 
[9].  Local  features  are  extracted  by  applying  multi-scale 
and  multi-orientation  filters  to  the  input  image.  The  re¬ 
sponses  of  the  filters  on  the  training  set  are  modeled  as 
Gaussian  distributions.  Detecting  components  has  also  been 
applied  to  face  recognition.  In  [18]  local  features  are  com¬ 
puted  on  the  nodes  of  an  elastic  grid.  Separate  templates 
for  the  eyes,  the  nose,  and  the  mouth  are  matched  in  [1,  2]. 
Finally,  a  component-based  approach  for  people  detection 
using  SVMs  was  proposed  in  [7]. 

There  are  three  basic  ideas  behind  part-  or  component- 
based  detection  of  objects.  First,  some  object  classes  can 
be  described  well  by  a  few  characteristic  object  parts  and 
their  geometrical  relation.  Second,  the  patterns  of  some 
object  parts  might  vary  less  under  pose  changes  than  the 
pattern  belonging  to  the  whole  object.  Third,  a  component- 
based  approach  might  be  more  robust  against  partial  occlu¬ 
sions  than  a  global  approach.  The  two  main  problems  of 
a  component-based  approach  are  how  to  choose  the  set  of 
discriminatory  object  parts  and  how  to  model  their  geomet¬ 
rical  configuration.  The  above  mentioned  approaches  either 
manually  define  a  set  of  components  and  model  their  geo¬ 
metrical  configuration  or  uniformly  partition  the  image  into 
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components  and  assume  statistical  independence  between 
them. 

We  propose  a  technique  for  learning  relevant  compo¬ 
nents  from  3-D  head  models.  The  technique  starts  with  a  set 
of  small  seed  regions  that  are  gradually  grown  by  minimiz¬ 
ing  a  bound  on  the  expected  error  probability  of  an  SVM. 
This  approach  has  the  advantage  that  no  manual  interaction 
is  required  for  choosing  and  extracting  components  from 
the  training  set.  Once  the  components  have  been  deter¬ 
mined,  we  train  a  system  consisting  of  a  two-level  hierar¬ 
chy  of  SVM  classifiers.  On  the  first  level,  component  clas¬ 
sifiers  independently  detect  facial  components.  On  the  sec¬ 
ond  level,  a  single  classifier  checks  if  the  geometrical  con¬ 
figuration  of  the  detected  components  in  the  image  matches 
a  geometrical  model  of  a  face. 

The  outline  of  the  paper  is  as  follows:  Section  2  gives  a 
brief  overview  of  SVM  learning.  In  Section  3  we  describe 
the  component-based  face  detection  system.  A  method  for 
automatically  extracting  components  from  synthetic  face 
images  is  presented  in  Section  4.  Section  5  contains  ex¬ 
perimental  results  and  a  comparison  between  the  global  and 
component-based  approaches.  Section  6  concludes  the  pa¬ 
per. 


2.  Learning  with  Support  Vector  Machines 


In  this  section  we  outline  the  basic  theory  of  SVMs  [16]. 
SVMs  perform  pattern  recognition  for  two-class  problems 
by  determining  the  separating  hyperplane 1  with  maximum 
distance  to  the  closest  points  of  the  training  set.  These 
points  are  called  support  vectors.  If  the  data  is  not  linearly 
separable  in  the  input  space,  a  non-linear  transformation 
$(•)  can  be  applied  which  maps  the  data  points  x  G  IRn  of 
the  input  space  into  a  high  (possibly  infinite)  dimensional 
space  JRP  which  is  called  feature  space.  The  data  in  the  fea¬ 
ture  space  is  then  separated  by  the  optimal  hyperplane  as 
described  above.  The  mapping  $(•)  is  implemented  in  the 
SVM  classifier  by  a  kernel  function  iT(-,  •)  which  defines 
an  inner  product  in  IR^,  i.e.  iT(x,t)  =  4>(x)  •  4>(t).  The 
decision  function  of  the  SVM  has  the  form: 

i 

/(x)  =  Y  aiViR^  x)’  (!) 

where  l  is  the  number  of  data  points  in  the  training  set,  and 
yi  G  {  —  1, 1}  is  the  class  label  of  the  data  point  x*.  The 
coefficients  a;  in  Eq.  (1)  are  the  solution  of  a  quadratic  pro¬ 
gramming  problem  [16]. 

Let  M  be  twice  the  distance  of  the  support  vectors  to  the 
hyperplane.  This  quantity  is  called  margin  and  is  given: 


'SVM  theory  also  includes  the  case  of  non-separable  data,  see  [16]. 


The  margin  is  an  indicator  of  the  separability  of  the  data. 
In  fact,  the  expected  error  probability  of  the  SVM,  EPerr , 
satisfies  the  following  bound  [16]: 


EP 

err 


"  D2  " 

M2 


(3) 


with  D  being  the  diameter  of  the  smallest  sphere  contain¬ 
ing  the  data  points  in  the  feature  space.  Later  in  the  paper 
we  will  attempt  to  minimize  this  quantity  to  automatically 
extract  components. 


3.  Component-based  face  detection 
3.1.  Motivation 


We  briefly  mentioned  in  the  introduction  that  a  global 
approach  is  highly  sensitive  to  changes  in  the  pose  of  an  ob¬ 
ject.  Lig.  1  illustrates  this  problem  for  the  simple  case  of  lin¬ 
ear  classification.  The  result  of  training  a  linear  classifier  on 
faces  can  be  represented  as  a  single  face  template,  schemat¬ 
ically  drawn  in  Lig.  1  a).  Even  for  small  rotations  the  tem¬ 
plate  clearly  deviates  from  the  rotated  faces  as  shown  in 
Lig.  1  b)  and  c).  The  component-based  approach  tries  to 
avoid  this  problem  by  independently  detecting  parts  of  the 
face.  In  Lig.  2  the  eyes,  nose,  and  the  mouth  are  represented 
as  single  templates.  Lor  small  rotations  the  changes  in  the 
components  are  small  compared  to  the  changes  in  whole 
face  pattern.  Slightly  shifting  the  components  is  sufficient 
to  achieve  a  reasonable  match  with  the  rotated  faces. 


Figure  1.  Matching  with  a  single  template. 
The  schematic  template  of  a  frontal  face  is 
shown  in  a).  Slight  rotations  of  the  face  in  the 
image  plane  b)  and  in  depth  c)  lead  to  consid¬ 
erable  discrepancies  between  template  and 
face. 


3.2.  Overview  of  the  System 

An  overview  of  our  two-level  component-based  classi¬ 
fier  is  shown  in  Lig.  3.  On  the  first  level,  component  clas¬ 
sifiers  independently  detect  components  of  the  face.  In  the 
example  shown  these  components  are  the  eyes,  the  nose  and 
the  mouth.  We  used  linear  SVM  classifiers,  each  of  which 
was  trained  on  a  set  of  extracted  facial  components  and  on 
a  set  of  randomly  selected  non-face  patterns.  The  compo¬ 
nents  were  automatically  extracted  from  synthetic  58x58 
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Figure  2.  Matching  with  a  set  of  component 
templates.  The  schematic  component  tem¬ 
plates  for  a  frontal  face  are  shown  in  a).  Shift¬ 
ing  the  component  templates  can  compen¬ 
sate  for  slight  rotations  of  the  face  in  the  im¬ 
age  plane  b)  and  in  depth  c). 


face  images  generated  from  3-D  head  models.  On  the  sec¬ 
ond  level  the  geometrical  configuration  classifier  performs 
the  final  face  detection  by  linearly  combining  the  results 
of  the  component  classifiers.  Given  a  58x58  window,  the 
maximum  continuous  outputs  of  the  component  classifiers 
within  rectangular  search  regions2  around  the  expected  po¬ 
sitions  of  the  components  are  used  as  inputs  to  the  geomet¬ 
rical  configuration  classifier.  The  search  regions  have  been 
calculated  from  the  mean  and  standard  deviation  of  the  lo¬ 
cations  of  the  components  in  the  training  images.  We  also 
provide  the  geometrical  classifier  with  the  precise  positions 
of  the  detected  components  relative  to  the  upper  left  corner 
of  the  58x58  window.  Overall  we  have  three  values  per 
component  classifier  that  are  propagated  to  the  geometri¬ 
cal  classifier.  The  system  is  computed  as  follows:  We  de¬ 
note  the  input  image  as  x  and  the  extracted  components  as 
{x^  }t=i  •  The  decision  function  of  a  component  classifier  is 
then  given  by: 


/V)  =  y>^(x*,V). 


where  Kl  is  the  kernel  used  by  the  t-th  classifier.  The  ge¬ 
ometrical  configuration  classifier  F(x)  is  a  linear  combi¬ 
nation  of  the  outputs  of  the  component  classifiers  and  the 
image  locations  (ft*,  u*)  of  the  detected  components: 

f(x)  =  £  c*-(/VU*V)T> 

t= 1 

The  coefficient  vectors  cl  are  learned  from  the  examples: 

{ (Z1  (xl ) ,  h] ,  vj , . . . ,  fT  (xf ) ,  h J  ,  vf ,  yt ) } \=l 

where  the  label  yi  is  1  for  faces  and  —  1  for  non-face  exam¬ 
ples  and  l  is  the  number  of  examples. 


2 To  account  for  changes  in  the  size  of  the  components,  the  outputs  were 
determined  over  multiple  scales  of  the  input  image.  In  our  tests,  we  set  the 
range  of  scales  to  [0.75, 1.2]. 


First  Level: 
Component 
Classifiers 


Second  Level: 
Detection  of 
Configuration  of 
Components 


Output  of  Output  of  Output  of 
Eye  Classifier  Nose  Classifier  Mouth  Classifier 


Figure  3.  System  overview  of  the  component- 
based  classifier  using  four  components.  On 
the  first  level,  windows  of  the  size  of  the  com¬ 
ponents  (solid  lined  boxes)  are  shifted  over 
the  face  image  and  classified  by  the  compo¬ 
nent  classifiers.  On  the  second  level,  the 
maximum  outputs  of  the  component  classi¬ 
fiers  within  predefined  search  regions  (dotted 
lined  boxes)  and  the  positions  of  the  compo¬ 
nents  are  fed  into  the  geometrical  configura¬ 
tion  classifier. 


3.3.  Training  Data 

Extracting  face  patterns  is  usually  a  tedious  and  time- 
consuming  work  that  has  to  be  done  manually.  Taking  the 
component-based  approach  we  would  have  to  manually  ex¬ 
tract  each  single  component  from  all  images  in  the  training 
set.  This  procedure  would  only  be  feasible  for  a  small  num¬ 
ber  of  components.  For  this  reason  we  used  textured  3-D 
head  models  [17]  to  generate  the  training  data.  By  render¬ 
ing  the  3-D  head  models  we  could  automatically  generate 
large  numbers  of  faces  in  arbitrary  poses  and  with  arbitrary 
illumination.  In  addition  to  the  3-D  information  we  also 
knew  the  3-D  correspondences  for  a  set  of  reference  points 
shown  in  Fig.  4.  These  correspondences  allowed  us  to  auto¬ 
matically  extract  facial  components  located  around  the  ref¬ 
erence  points.  Originally  we  had  7  textured  head  models 
acquired  by  a  3-D  scanner.  Additional  head  models  were 
generated  by  3-D  morphing  between  all  pairs  of  the  origi¬ 
nal  head  models.  The  heads  were  rotated  between  —  30  °  and 
30°  in  depth.  The  faces  were  illuminated  by  ambient  light 
and  a  single  directional  light  pointing  towards  the  center  of 
the  face,  some  examples  are  shown  in  Fig.  5.  The  position 
of  the  light  varied  between  —30°  and  30°  in  azimuth  and 
between  30°  and  60°  in  elevation.  Overall,  we  generated 
2,457  face  images  of  size  58x58. 
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The  negative  training  set  initially  consisted  of  10,209 
58x58  non-face  patterns  randomly  extracted  from  502  non¬ 
face  images.  We  then  applied  bootstrapping  to  enlarge  the 
training  data  by  non-face  patterns  that  look  similar  to  faces. 
To  do  so  we  trained  a  single,  linear  SVM  classifier  and 
applied  it  to  the  previously  used  set  of  502  non-face  im¬ 
ages.  The  false  positives  (FPs)  were  added  to  the  non-face 
training  data  to  build  the  final  non-face  training  set  of  size 
13,654. 


Figure  4.  Reference  points  on  the  head  mod¬ 
els  which  were  used  for  3-D  morphing  and 
automatic  extraction  of  facial  components. 


Figure  5.  Examples  of  synthetic  faces. 


4.  Learning  Components 

A  main  problem  of  the  component-based  approach  is 
how  to  choose  the  set  of  discriminatory  object  parts.  For 
the  class  of  faces  an  obvious  choice  of  components  would 
be  the  eyes,  the  nose  and  the  mouth.  However,  for  other 
classes  of  objects  it  might  be  more  difficult  to  manually  de¬ 
fine  a  set  of  intuitively  meaningful  components.  Instead  of 


manually  choosing  the  components  it  would  be  more  sensi¬ 
ble  to  choose  the  components  automatically  based  on  their 
discriminative  power  and  their  robustness  against  pose  and 
illumination  changes. 

Training  a  large  number  of  classifiers  on  components  of 
random  size  and  location  is  one  way  to  approach  the  prob¬ 
lem  of  automatically  determining  components.  The  com¬ 
ponents  can  be  ranked  and  selected  based  on  the  training 
results  of  the  classifiers,  e.g.  the  bound  on  the  expected 
error  probability.  However,  this  method  is  computational 
extensive  in  the  training  stage. 

An  alternative  to  using  a  large  set  of  arbitrary  compo¬ 
nents  is  to  specifically  generate  discriminative  components. 
Following  this  idea,  we  developed  a  method  that  automati¬ 
cally  determines  rectangular  components  from  a  set  of  syn¬ 
thetic  face  images.  The  algorithm  starts  with  a  small  rect¬ 
angular  component  located  around  a  pre-selected  point  in 
the  face  (e.g.  center  of  the  left  eye).  Note  that  we  could  lo¬ 
cate  the  same  facial  point  in  all  face  images  since  we  knew 
the  point-by-point  correspondences  between  the  3-D  head 
models.  The  component  is  extracted  from  all  synthetic  face 
images  to  build  a  training  set  of  positive  examples.  We  also 
generate  a  training  set  of  non-face  patterns  that  have  the 
same  rectangular  shape  as  the  component.  After  training  an 
SVM  on  the  component  data  we  estimate  the  performance 
of  the  SVM  based  on  the  estimated  upper  bound  on  the  ex¬ 
pected  probability  of  error.  According  to  Eq.  (3)  we  calcu¬ 
late: 


P  = 


D2 

M2’ 


(4) 


where  D  is  the  diameter  of  the  smallest  sphere3  in  the  fea¬ 
ture  space  IRP  containing  the  support  vectors,  and  M  is  the 
margin  given  by  Eq.  (2).  After  determining  p  we  enlarge  the 
component  by  expanding  the  rectangle  by  one  pixel  into  one 
of  the  four  directions  (up,  down,  left,  right).  Again,  we  gen¬ 
erate  training  data,  train  an  SVM  and  determine  p.  We  do 
this  for  expansions  into  all  four  directions  and  finally  keep 
the  expansion  which  decreases  p  the  most.  This  process  is 
continued  until  the  expansions  into  all  four  directions  lead 
to  an  increase  of  p.  In  our  experiments  we  started  with  14 
seed  regions  of  size  5x5  most  of  them  located  in  the  vicinity 
of  the  eyes,  nose  and  mouth.  Fig.  6  shows  the  results  after 
component  growing;  the  size  of  the  components  is  given  in 
Table  4. 


5.  Experiments 


3  In  our  experiments  we  replaced  D 2  in  Eq.  (4)  by  the  dimensional¬ 
ity  p  of  the  feature  space.  This  because  our  data  points  lay  within  an  p- 
dimensional  cube  of  length  1,  so  the  smallest  sphere  containing  the  data 
had  radius  equal  to  y/p/  2.  This  approximation  was  mainly  for  computa¬ 
tional  reasons  as  in  order  to  compute  D  we  need  to  solve  an  optimization 
problem  [8]. 


4 


Components 

Width 

Height 

Eyebrows 

19 

15 

Eyes 

17 

17 

Between  eyes 

18 

16 

Nose 

15 

20 

Nostrils 

22 

12 

Cheeks 

21 

20 

Mouth 

31 

15 

Eip 

13 

16 

Corners  of  the  mouth 

18 

11 

Table  1.  Size  of  the  learned  components. 


Figure  6.  The  fourteen  learned  components. 
The  crosses  mark  the  centers  of  the  compo¬ 
nents. 


In  our  experiments  we  compared  the  component-based 
system  to  a  classifier  trained  on  the  whole  face  pattern.  The 
component  system  consisted  of  14  linear  SVM  classifiers 
for  component  detection  and  a  single  linear  SVM  as  geo¬ 
metrical  classifier.  The  whole  face  classifier  was  a  single 
linear  SVM  trained  on  gray  values  of  the  whole  face  pattern. 
The  training  data  for  both  classifiers  consisted  of  2,457  syn¬ 
thetic  gray  face  images  and  13,655  non-face  gray  images  of 
size  58x58. 

The  positive  test  consisted  of  1,834  faces  rotated  be¬ 
tween  about  —30°  and  30°  in  depth.  The  faces  were  man¬ 
ually  extracted  from  the  CMU  PIE  database  [14].  The  neg¬ 
ative  test  set  consisted  of  24,464  difficult  non-face  patterns 
that  were  collected  by  a  fast  face  detector  [5]  from  web  im¬ 
ages4.  The  false  positive  (FP)  rate  was  calculated  relative 
to  the  number  of  non-face  test  images.  The  comparison  be- 

4The  test  database  together  with  a  detailed  description  of  the  experi¬ 
ments  [4]  can  be  found  on  the  MIT/CBCL  web  page. 


tween  SVM  whole  face  classifiers  (linear  and  polynomial 
kernels)  and  a  component  classifier  consisting  of  14  linear 
SVM  component  classifiers  and  a  linear  SVM  geometri¬ 
cal  configuration  classifier  is  shown  in  Fig.  7.  For  bench¬ 
marking  we  also  added  the  ROC  curve  of  a  second-degree 
polynomial  kernel  SVM  trained  on  19  x  19  real  face  images. 
This  face  detector  is  described  and  evaluated  in  detail  in  [3] 
and  performed  amongst  the  best  face  detection  systems  on 
the  CMU  test  set  [10]  including  frontal  and  near-frontal  face 
images.  The  component  system  outperforms  all  whole  face 
systems.  Some  detection  results  generated  by  the  compo¬ 
nent  system  are  shown  in  Fig.  8. 


Components  vs.  Whole  Face 

Training  (58x58):  2,457  synthetic  faces  ,  13,654  non-faces 
Training  (19x19):  10,038  faces  ,  36,220  non-faces 
Test:  Subset  of  CMU  PIE,  1,834  faces,  24,464  non-faces 


Figure  7.  ROC  curves  for  whole  face  classi¬ 
fiers  and  the  14  component  classifier. 


6.  Conclusion 

We  presented  a  component-based  system  for  face  detec¬ 
tion  using  SVM  classifiers.  The  system  performs  the  de¬ 
tection  by  means  of  a  two  level  hierarchy  of  classifiers.  On 
the  first  level,  the  component  classifiers  independently  de¬ 
tect  parts  of  the  face.  On  the  second  level,  the  geometrical 
configuration  classifier  combines  the  results  of  the  compo¬ 
nent  classifiers  and  performs  the  final  detection  step.  Ex¬ 
periments  on  real  face  images  show  a  significant  improve¬ 
ment  in  the  classification  performance  compared  to  a  whole 
face  detection  system.  We  also  proposed  a  region  growing 
method  that  involves  measures  derived  from  SVM  theory  to 
learn  relevant  components  from  a  set  of  3-D  head  models. 
The  use  of  3-D  head  models  allowed  us  to  automatically  ex¬ 
tract  components  and  to  arbitrarily  change  the  illumination 
and  the  viewpoint.  Both,  the  component-based  classifica¬ 
tion  system  and  the  technique  for  component  learning  can 
be  applied  to  other  object  detection  tasks  in  computer  vi¬ 
sion. 
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Figure  8.  Faces  detected  by  the  1 4  component 
system. 
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